word alignment
TransAlign: Machine Translation Encoders are Strong Word Aligners, Too
Ebing, Benedikt, Goldschmied, Christian, Glavaš, Goran
In the absence of sizable training data for most world languages and NLP tasks, translation-based strategies such as translate-test -- evaluating on noisy source language data translated from the target language -- and translate-train -- training on noisy target language data translated from the source language -- have been established as competitive approaches for cross-lingual transfer (XLT). For token classification tasks, these strategies require label projection: mapping the labels from each token in the original sentence to its counterpart(s) in the translation. To this end, it is common to leverage multilingual word aligners (WAs) derived from encoder language models such as mBERT or LaBSE. Despite obvious associations between machine translation (MT) and WA, research on extracting alignments with MT models is largely limited to exploiting cross-attention in encoder-decoder architectures, yielding poor WA results. In this work, in contrast, we propose TransAlign, a novel word aligner that utilizes the encoder of a massively multilingual MT model. We show that TransAlign not only achieves strong WA performance but substantially outperforms popular WA and state-of-the-art non-WA-based label projection methods in MT-based XLT for token classification.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > Austria > Vienna (0.14)
- Europe > Slovenia > Drava > Municipality of Benedikt > Benedikt (0.04)
- (23 more...)
Bridging Language Gaps: Advances in Cross-Lingual Information Retrieval with Multilingual LLMs
Goworek, Roksana, Macmillan-Scott, Olivia, Özyiğit, Eda B.
Cross-lingual information retrieval (CLIR) addresses the challenge of retrieving relevant documents written in languages different from that of the original query. Research in this area has typically framed the task as monolingual retrieval augmented by translation, treating retrieval methods and cross-lingual capabilities in isolation. Both monolingual and cross-lingual retrieval usually follow a pipeline of query expansion, ranking, re-ranking and, increasingly, question answering. Recent advances, however, have shifted from translation-based methods toward embedding-based approaches and leverage multilingual large language models (LLMs), for which aligning representations across languages remains a central challenge. The emergence of cross-lingual embeddings and multilingual LLMs has introduced a new paradigm, offering improved retrieval performance and enabling answer generation. This survey provides a comprehensive overview of developments from early translation-based methods to state-of-the-art embedding-driven and generative techniques. It presents a structured account of core CLIR components, evaluation practices, and available resources. Persistent challenges such as data imbalance and linguistic variation are identified, while promising directions are suggested for advancing equitable and effective cross-lingual information retrieval. By situating CLIR within the broader landscape of information retrieval and multilingual language processing, this work not only reviews current capabilities but also outlines future directions for building retrieval systems that are robust, inclusive, and adaptable.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > United States > Washington > King County > Seattle (0.14)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- (44 more...)
- Research Report (1.00)
- Overview (1.00)
- Education (1.00)
- Government (0.67)
- Media > News (0.45)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
An Autoencoder Approach to Learning Bilingual Word Representations
Cross-language learning allows us to use training data from one language to build models for a different language. Many approaches to bilingual learning require that we have word-level alignment of sentences from parallel corpora. In this work we explore the use of autoencoder-based methods for cross-language learning of vectorial word representations that are aligned between two languages, while not relying on word-level alignments. We show that by simply learning to reconstruct the bag-of-words representations of aligned sentences, within and between languages, we can in fact learn high-quality representations and do without word alignments. We empirically investigate the success of our approach on the problem of cross-language text classification, where a classifier trained on a given language (e.g., English) must learn to generalize to a different language (e.g., German). In experiments on 3 language pairs, we show that our approach achieves state-of-the-art performance, outperforming a method exploiting word alignments and a strong machine translation baseline.
Whisper Has an Internal Word Aligner
Yeh, Sung-Lin, Meng, Yen, Tang, Hao
Abstract--There is an increasing interest in obtaining accurate word-level timestamps from strong automatic speech recognizers, in particular Whisper . Existing approaches either require additional training or are simply not competitive. The evaluation in prior work is also relatively loose, typically using a tolerance of more than 200 ms. In this work, we discover attention heads in Whisper that capture accurate word alignments and are distinctively different from those that do not. Moreover, we find that using characters produces finer and more accurate alignments than using wordpieces. Based on these findings, we propose an unsupervised approach to extracting word alignments by filtering attention heads while teacher forcing Whisper with characters. Our approach not only does not require training but also produces word alignments that are more accurate than prior work under a stricter tolerance between 20 ms and 100 ms.
- South America > Colombia > Meta Department > Villavicencio (0.04)
- North America > Canada > Quebec > Montreal (0.04)
MockConf: A Student Interpretation Dataset: Analysis, Word- and Span-level Alignment and Baselines
Javorský, Dávid, Bojar, Ondřej, Yvon, François
In simultaneous interpreting, an interpreter renders a source speech into another language with a very short lag, much sooner than sentences are finished. In order to understand and later reproduce this dynamic and complex task automatically, we need dedicated datasets and tools for analysis, monitoring, and evaluation, such as parallel speech corpora, and tools for their automatic annotation. Existing parallel corpora of translated texts and associated alignment algorithms hardly fill this gap, as they fail to model long-range interactions between speech segments or specific types of divergences (e.g., shortening, simplification, functional generalization) between the original and interpreted speeches. In this work, we introduce MockConf, a student interpreting dataset that was collected from Mock Conferences run as part of the students' curriculum. This dataset contains 7 hours of recordings in 5 European languages, transcribed and aligned at the level of spans and words. We further implement and release InterAlign, a modern web-based annotation tool for parallel word and span annotations on long inputs, suitable for aligning simultaneous interpreting. We propose metrics for the evaluation and a baseline for automatic alignment. Dataset and tools are released to the community.
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > Bulgaria (0.05)
- (28 more...)
Conditional Random Field Autoencoders for Unsupervised Structured Prediction
Waleed Ammar, Chris Dyer, Noah A. Smith
We introduce a framework for unsupervised learning of structured predictors with overlapping, global features. Each input's latent representation is predicted conditional on the observed data using a feature-rich conditional random field (CRF). Then a reconstruction of the input is (re)generated, conditional on the latent structure, using a generative model which factorizes similarly to the CRF. The autoencoder formulation enables efficient exact inference without resorting to unrealistic independence assumptions or restricting the kinds of features that can be used. We illustrate connections to traditional autoencoders, posterior regularization, and multi-view learning. We then show competitive results with instantiations of the framework for two canonical tasks in natural language processing: part-of-speech induction and bitext word alignment, and show that training the proposed model can be substantially more efficient than a comparable feature-rich baseline.
An Autoencoder Approach to Learning Bilingual Word Representations
Cross-language learning allows us to use training data from one language to build models for a different language. Many approaches to bilingual learning require that we have word-level alignment of sentences from parallel corpora. In this work we explore the use of autoencoder-based methods for cross-language learning of vectorial word representations that are aligned between two languages, while not relying on word-level alignments. We show that by simply learning to reconstruct the bag-of-words representations of aligned sentences, within and between languages, we can in fact learn high-quality representations and do without word alignments. We empirically investigate the success of our approach on the problem of cross-language text classification, where a classifier trained on a given language (e.g., English) must learn to generalize to a different language (e.g., German).
BinaryAlign: Word Alignment as Binary Sequence Labeling
Latouche, Gaetan Lopez, Carbonneau, Marc-André, Swanson, Ben
Real world deployments of word alignment are almost certain to cover both high and low resource languages. However, the state-of-the-art for this task recommends a different model class depending on the availability of gold alignment training data for a particular language pair. We propose BinaryAlign, a novel word alignment technique based on binary sequence labeling that outperforms existing approaches in both scenarios, offering a unifying approach to the task. Additionally, we vary the specific choice of multilingual foundation model, perform stratified error analysis over alignment error type, and explore the performance of BinaryAlign on non-English language pairs. We make our source code publicly available.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Africa > Middle East > Egypt > Giza Governorate > Giza (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- (7 more...)
A Word Order Synchronization Metric for Evaluating Simultaneous Interpretation and Translation
Makinae, Mana, Sudoh, Katsuhito, Yamada, Mararu, Nakamura, Satoshi
Simultaneous interpretation (SI), the translation of one language to another in real time, starts translation before the original speech has finished. Its evaluation needs to consider both latency and quality. This trade-off is challenging especially for distant word order language pairs such as English and Japanese. To handle this word order gap, interpreters maintain the word order of the source language as much as possible to keep up with original language to minimize its latency while maintaining its quality, whereas in translation reordering happens to keep fluency in the target language. This means outputs synchronized with the source language are desirable based on the real SI situation, and it's a key for further progress in computational SI and simultaneous machine translation (SiMT). In this work, we propose an automatic evaluation metric for SI and SiMT focusing on word order synchronization. Our evaluation metric is based on rank correlation coefficients, leveraging cross-lingual pre-trained language models. Our experimental results on NAIST-SIC-Aligned and JNPC showed our metrics' effectiveness to measure word order synchronization between source and target language.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > Italy > Tuscany > Florence (0.05)
- North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
- (6 more...)
Cross-Lingual Word Alignment for ASEAN Languages with Contrastive Learning
Zhang, Jingshen, Qiu, Xinying, Shen, Teng, Wang, Wenyu, Zhang, Kailin, Feng, Wenhe
Cross-lingual word alignment plays a crucial role in various natural language processing tasks, particularly for low-resource languages. Recent study proposes a BiLSTM-based encoder-decoder model that outperforms pre-trained language models in low-resource settings. However, their model only considers the similarity of word embedding spaces and does not explicitly model the differences between word embeddings. To address this limitation, we propose incorporating contrastive learning into the BiLSTM-based encoder-decoder framework. Our approach introduces a multi-view negative sampling strategy to learn the differences between word pairs in the shared cross-lingual embedding space. We evaluate our model on five bilingual aligned datasets spanning four ASEAN languages: Lao, Vietnamese, Thai, and Indonesian. Experimental results demonstrate that integrating contrastive learning consistently improves word alignment accuracy across all datasets, confirming the effectiveness of the proposed method in low-resource scenarios. We will release our data set and code to support future research on ASEAN or more low-resource word alignment.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Africa > Middle East > Egypt > Giza Governorate > Giza (0.05)
- Asia > China > Guangdong Province > Guangzhou (0.04)
- (7 more...)
- Research Report > Experimental Study (0.68)
- Research Report > New Finding (0.48)